Predicted and Verified Deviation from Zipf s Law in Growing Social Networks 
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Zipf 's power law is a general empirical regularity found in many natural and social systems. A 
recently developed theory predicts that Zipf 's law corresponds to systems that are growing according 
to a maximally sustainable path in the presence of random proportional growth, stochastic birth 
and death processes. We report a detailed empirical analysis of a burgeoning network of social 
groups, in which all ingredients needed for Zipf's law to apply are verifiable and verified. We 
estimate empirically the average growth r and its standard deviation o as well as the death rate 
h and predict without adjustable parameters the exponent ^ of the power law distribution P{s) of 
the group sizes s. The predicted value ^ = 0.75 ± 0.05 is in excellent agreement with maximum 
likelihood estimations. According to theory, the deviation of P(s) from Zipf's law (i.e., ^ < 1) 
constitutes a direct statistical quantitative signature of the overall non-stationary growth of the 
social universe. 

PACS numbers: 05.40.-a,05.45.Xt, 89.65.-s 



Power law distributions, 



(1) 



are ubiquitous characteristics of many natural and so- 
cial systems. The function p{s) is the density associated 
with the probability P[s) — ¥v{S > s} that the value S 
of some stochastic variable, usually a size or frequency, 
is greater than s. Among power law distributions, Zipf's 
law states that /x = 1, i.e., P{s) ~ s~^ for large s. Zipf's 
law has been reported for many systems ^, including 
word frequencies [2], firm sizes [3], city sizes [4], connec- 
tions between Web pages [5] and between open source 
software packages [B], Internet traffic characteristics [7], 
abundance of expressed genes in yeast, nematodes and 
human tissues [8J and so on. The apparent ubiquity and 
universality of Zipf's law has triggered numerous efforts 
to explain its validity. Deviations from Zipf's law pro- 
vide also important informative insights in the dynamics 
of the corresponding systems. 

Since H. Simon's pioneering work [5HTT|. a crucial in- 
gredient in the generating mechanism of Zipf's law is un- 
derstood to be Gibrat's rule of proportional growth [12], 
more recently rediscovered under the name of "prefer- 
ential attachment" in the context of networks [13] . Ex- 
pressed in continuous time in terms of the size S{t) of a 
firm, a city or, more generally, a social group, Gibrat's 
rule corresponds to the geometric Brownian motion 



dS{t)^S{t){rdt + adW{t)) 



(2) 



where the stochastic growth rate r + adW/dt is decom- 
posed into its average r and its fluctuation part with am- 
plitude determined by the standard deviation a, while 
W(t) is a standard Wiener process. Gibrat's rule alone 
cannot produce ([l]), since the solution of equation ^ 
has a log-normal distribution. Simon and many other 
authors invoked an addition ingredient, corresponding to 
various modifications of the multiplicative process when 



S{t) becomes small. Then, under very general condi- 
tions, the distribution of S becomes a power law, with 
an exponent /i that is a function of the distribution of 
the multiplicative factors [I4UI6] . 

The fact that the exponent /i is often found close to 
1 requires another crucial ingredient. One particularly 
intriguing proposition is that Zipf's law corresponds to 
systems that are growing according to a maximally sus- 
tainable path |17) . In other words, when Zipf's law 
holds, the set of stochastically growing entities {Si{t),i — 
1, 2, ..., n, ..} is delicately poised at a dynamical critical 
growth point. Within a general framework in which (i) 
entities are born at random times, (ii) grow stochastically 
according to (pi), and (iii) can disappear or die according 
to various stochastic processes with some hazard rate h, 
the explicit calculation of the exponent fj, confirms the 
above optimal growth condition associated with Zipf's 
law ifi^l) :i7J. 

Here, we present an empirical test of the optimal 
growth condition for Zipf's law by testing the formula for 
exponent fi (see below) on a unique database obtained 
from a Web platform of collaborative social projects 
(Amazee.com). In this dataset, we verify empirically that 
proportional growth holds, we measure the parameters 
r,a, h and the exponent ^oxp of the power law distribu- 
tion of project sizes. We show that the theory leading 
to the maximum sustainable growth principle explains 
remarkably well the empirical value /Xcxp, with no ad- 
justable parameters. 

The theory is based on the following assumptions [TJ 
I17j . Consider a population of social groups (firms, cities, 
projects, and so on), which can take different forms and 
can be applied in many different contexts. 

1. There is a flow of group entries, i.e., a sequence of 
births of new groups. The times {ti < t2 < ... < 
ti < ...} of entries of new groups follow a Poisson 
process with constant intensity (generalizations to 



a vast class of non-Poisson processes do not modify 
the key result [II [TT]). 

2. At time ti, i gN, the initial size of the new entrant 
group z is a random variable so,i. The sequence 
{•5o,j}jgjv is the result of independent and iden- 
tically distributed random draws from a common 
random variable sq. All the draws are independent 
of the entry dates of the groups. 

3. Gibrat's rule of proportional growth holds. This 
means that, in the continuous time limit, the size 
Si{t) of the i*'' group at time t > ti, conditional 
on its initial size Sq, is solution to the stochas- 
tic differential equation (pi), where the drift r and 
the volatility a are the same for all groups but the 
Wiener process Wi{t) is specific to each project i. 

4. Groups can exit (disappear) at random, with con- 
stant hazard rate h > 0, which is independent of 
the size and age of the group. 

Under these conditions, the central result of [17] reads as 
follows. Defining 
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provided that E [Sq] < oo, and for times larger than 
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the average distribution of project's sizes follows an 
asymptotic power law with tail index /i given by (p]), in 
the following sense: the average number of projects with 
size larger than s is proportional to s^^ as s — ^ oo. As 
a corollary, the exponent n of the distribution of sizes 
takes the value 1 corresponding to Zipf 's law, if and only 
ii r — h. 

In order to understand the corollary, notice that r ~ h 
represents the average growth rate of an incumbent 
group. Indeed, considering a group present at time t, 
during the next instant dt, it will either exit with prob- 
ability h ■ dt (and therefore its size declines by a factor 
— 100%) or grow at an average rate equal to r ■ dt, with 
probability {1 — h ■ dt). The coefficient r is therefore 
the conditional growth rate of projects, conditioned on 
not having died yet. Then, the unconditional expected 
growth rate over the small time increment dt of an in- 
cumbent group is {r — h) ■ dt + O {{dt'^). The statisti- 
cally stationary regime, in the presence of a stationary 
population of group forming individuals, corresponds to 
condition r — h. Malevergne et al. [17] showed that this 
condition can be easily generalized to the case where the 
population of group forming individuals grows itself with 
some exponential rate, as is the minimal viable group 



size [T7]. Then, this condition translates into that for the 
maximum sustainable growth of the universe of groups, 
as mentioned above. 

Our strategy is to find an empirical dataset in which 
(i) all ingredients of the theory can be verified explic- 
itly, (ii) all parameters r, a and h can be measured di- 
rectly and (iii) the empirical distribution of group sizes 
can be compared with to prediction ([I]) with ([3|. We 
have found such a database, with Amazee.com, which is 
a Web-based platform of collaboration. Using Amazee's 
Web-platform, anyone with an idea for a collaborative 
project can sign in and use the website to gather follow- 
ers, who will together help the project owner to accom- 
plish the project. An Amazee project can be of any type 
of activities, such as arts and culture, environment and 
nature, politics and beliefs, science and innovation, social 
and philanthropic, sports and leisure, and so on. Most of 
the projects are public, for instance, "build a strong com- 
munity of Internet entrepreneurs in Switzerland to ex- 
change information and have fun" (Web Monday Zurich) , 
"connect all women working in the Swiss ICT industry" 
(Tech Girls Switzerland) , "to provide fresh running water 
to each home in the small African village of Dixie" (Water 
for Dixie), and so on. Amazee.com provides a set of fea- 
tures covering the entire lifetime of a typical project, such 
as project planning, participants recruiting, fund raising, 
events and meetings hosting, communication, files archiv- 
ing, and so on. Users join Amazee.com by either creating 
a new project, or participating in projects created by oth- 
ers. The Amazee data we analyze contains the complete 
recording in time of the activities of all users creating and 
joining all the projects in existence from February 2008 
tiU May 2010. 

Projects can be seen as proxies of many naturally oc- 
curring entities, such as social groups, firms, cities, in- 
vestment vehicles, and so on, each driven by some goal, 
competition, and interaction within social networks. The 
detailed knowledge of the activity of the participants of 
all projects provides a remarkable opportunity to dissect 
and understand the dynamics of such systems. In the 
present study, we restrict our attention to the simplest 
measure of size, namely the number Si of members of 
project i. 

Amazee's platform started on February, 2008. We an- 
alyze four snapshots of the database, on 7 August 2008, 
on 8 February 2009, on 7 August 2009 and on 8 March 
2010. The first snapshot is six months after the birth of 
the operations on Amazee.com. With the parameter val- 
ues for r, a and h determined below, formula Q predicts 
a transient of 50-400 days. Therefore, except for the first 
snapshot, we should observe a reasonable convergence to 
the expected power law distribution. 

Table |l] and Fig [l] confirm that the distributions of 
project sizes obtained for these four snapshots are power 
laws (fTl) (p > 0.05 of a Kolmogorov-Smirnov (K-S) test 
for the three last snapshots), with some significant devi- 



TABLE I: Descriptive statistics of the sizes of Amazee's 
projects at different times, showing that most projects have 
a size of just a few individuals while a few projects have hun- 
dreds to more than one thousand members. Dates are in 
format day/month/year. 



Proportional growth of Amazee projects 





07.08.2008 


08.02.2009 


07.08.2009 


08.05.2010 


Number of projects 


436 


1165 


1562 


1829 


Mean size 


5 


10 


9 


8 


Minimum size 


1 


1 


1 


1 


Maximum size 


83 


1110 


1114 


1120 


Median size 


2 


2 


2 


1 



Data snapshot on 2008-08-07 



Data snapshot on 2009-02-08 
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sizes of Amazee Projects 
Data snapshot on 2009-08-07 
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sizes of Amazee Projects 
Data snapshot on 2010-03-08 
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FIG. 1: Blue dots: Probability distributions of Amazee 
project sizes measured for four snapshots on 7 August 2008, 
on 8 February 2009, on 7 August 2009 and on 8 March 2010. 
The maximum likelihood fits are shown by the green lines, 
with exponents fj, respectively equal to 0.64, 0.71, 0.74, 0.76. 



ation only for the earliest snapshot. This deviation can 
be interpreted as only a partial convergence to the sta- 
tionary growth regime, confirmed by the much smaller 
maximum size observed in the first snapshot. K-S tests 
on the same data for goodness of fit using competing dis- 
tributions such as the exponential distribution and the 
log normal distribution yield p- values less than 0.01, con- 
firming the power law as the best model. 

Because the numbers of project members are integers, 
the exponents fi corresponding to the empirical distribu- 
tions shown in Fig [l] are estimated using the maximum 
likelihood method (ML) with the normalized discrete ver- 
sion of (1), p{s) = TYYx^j where C(2^) is the Riemann 
zeta function: ((x) ~ X]^i*~^- The exponents are 
found around 0.7, with confidence intervals clearly ex- 
cluding 1 . We check the robustness of this conclusion by 
estimating the exponents ^ for the four snapshots as a 
function of a lower threshold above which the MLE is 
performed. For the three last snapshots, we find stable 
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sizes of Amazee projects 



FIG. 2: Test of Gibrat's law for the proportional growth of 
Amazee project sizes until 8 March 2010. The slopes of the 
fitted straight lines are exactly 1. 



estimations, with the 95% confidence intervals excluding 
the value 1. We can thus conclude that Zipf's law is 
rejected for this dataset. 



TABLE II: For each of the four snapshots of the amazee 
database, we report the parameters r, a and h as explained in 
the text. Reporting these parameters in expression (|3| yields 
the predicted exponents fi, which is compared with the em- 
pirical exponents fi estimated by maximum likelihood (MLE) . 
For each cell value, the 95% confidence interval is obtained by 
bootstrapping over lO'OOO realizations. Dates are in format 
day/month/year. 



Date 


07.08.2008 


08.02.2009 


07.08.2009 


08.03.2010 


r 


0.11 
[0.074, 0.20] 


0.031 
[0.027, 0.036] 


0.027 
[0.024, 0.031] 


0.019 
[0.017, 0.021] 


a 


0.30 
[0.23, 0.41] 


0.18 
[0.16, 0.20] 


0.18 
[0.16, 0.20] 


0.19 

[0.15, 0.24] 


h 


0.096 
[0.065, 0.17] 


0.021 
[0.019, 0.025] 


0.017 
[0.015, 0.019] 


0.011 
[0.0099, 0.012] 


/i (MLE) 


0.64 
[0.58, 0.70] 


0.71 
[0.67, 0.76] 


0.73 
[0.69, 0.78] 


0.76 
[0.72, 0.80] 


/i (TH)) 


0.89 
[0.78, 1.05] 


0.78 
[0.74, 0.81] 


0.73 
[0.70, 0.75] 


0.75 
[0.71, 0.79] 



We now test formula (|3|. For this, we test if model 
^ holds and proceed to estimating the parameters r, a 
and h. The proportional growth model posits that, for 
sufficiently small time intervals At, the mean E[A5] and 
the standard deviation a as of the increment of the size S 
of a given project should both be proportional to S. To 
test this proposition, all the Amazee projects are pooled 
together in 100 size intervals over all four snapshots. For 
each of the 100 size intervals, Figure [2] plots the aver- 
age daily increase of project sizes (EfAS"]) and its stan- 
dard deviation a^s s^s a function of S. Linear regres- 



sions give very high i?^'s, larger than 0.995, confirming 
that Gibrat's law holds. The parameters r and a are es- 
timated as the mean and standard deviations of the set 
of daily growth rates, and are reported in Table [Til Note 
that tJAS is much larger than AS', i.e., the stochastic 
component of the proportional growth clearly dominates 
(an essential condition for a power law to emerge in the 
model [T). 

Next, we find that the rate of birth of new projects 
on amazee.com is approximately described by a Poisson 
process, such that the probability that n projects are 
born in a given day is given by 



r 1 •^" 
Pr{n\ = — j-e 



(5) 



where A ~ 2.4 is the mean number of new born projects 
per day. 

Many projects eventually stop growing, when they 
have reached their goals or in the presence of operational 
problems. We qualify a project as "dead" at some time 
td^ if it has not added any new member for the 90 days 
following td- If born at some time t^,, its lifetime (. is then 
calculated as £ := td — h. For projects with lifetimes of 
12 days or larger, we find that the distribution of project 
lifetimes £ is very well approximated by the exponential 
law 



Pr{£ > T} = e 



-hT 



(6) 



where h is the death hazard rate, whose maximum like- 
lihood estimations are reported in Table ITT] for the four 
snapshots of Amazee's database. A Kolmogorov-Smirnov 
test applied to ^ gives a p-value (estimated by boot- 
strap) of 93.7%, confirming the exponential model (l6|. 

Using the empirically determined values of r, a and h, 
we are now in position to test the theoretical prediction 
([3]) for the exponents /i of the proportional growth model 
in the presence of stochastic birth and death process. As 
shown in Table [Til except for the first snapshot for which 
transient effects are present (as discussed before), the 
agreement is excellent, with no adjustable parameters! 

The detailed empirical analysis of the burgeoning so- 
cial networks on Amazee has provided a unique set-up to 
test predicted deviations from Zipf's law in a system in 
which all ingredients needed for Zipf's law to apply are 
verifiable and verified. The deviation from Zipf's law, 
namely that the exponent /i is smaller than 1, results 
from the fact that the average growth rate r of Amazee 
projects is higher than their death rate h. Hence, the de- 
viation from Zipf's law is a remarkable statistical signa- 



ture of the overall non-stationary growth of the Amazee 
universe. 

After their time of fame and fashion, power law dis- 
tributions have been sometimes decried as too general, 
perhaps too universal to really provide useful insights. 
Here, we have provided an example in which the value 
of the exponent, and in particular its size less than 1 is 
a direct fingerprint of the overall growth of a social sys- 
tem, under the combined actions of multiplicative noise, 
birth and death processes. Given the generality of these 
ingredients, the prediction of the power law exponents 
provides new understandings of power law distributions, 
which will be insightful to many natural, economic and 
social systems. 
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